Method and apparatus for customizable surveillance of network interfaces

ABSTRACT

A method in a data processing system for monitoring for errors on a network. Responsive to detecting a change in a state of the network, determine whether the change in state is a loss of a communications link to a remote data processing system. If the change in state is a loss of the communications link, determine whether the communications link was established for at least a first period of time to be considered an acceptable connection to the remote data processing system; and create a new serviceable event, if a second period of time passes without reestablishing the communications link to the selected data processing system. Repeat occurrences of identical outages are tracked and multiple detected instances of an outage for different partitions are counted as a single failure in the examples.

BACKGROUND OF THE INVENTION

1. Technical Field

The present invention related generally to an improved data processingsystem and in particular to a method and apparatus for processing data.Still more particularly, the present invention provides a method,apparatus, and computer instructions for customizable surveillance ofnetwork interfaces.

2. Description of Related Art

A logical partitioned (LPAR) functionality within a data processingsystem (platform) allows multiple copies of a single operating system(OS) or multiple heterogeneous operating systems to be simultaneouslyrun on a single data processing system platform. A partition, withinwhich an operating system image runs, is assigned resources of aplatform in which some resources are not overlapping while otherresources may be shared. In particular, global resources, such as powersupplies, fans, and system backplanes are shared across all of thepartitions, while local resources such as I/O adapters and devices arenot shared between partitions. These platform allocable resourcesinclude one or more architecturally distinct processors with theirinterrupt management area, regions of system memory, and input/output(I/O) adapter bus slots. The partition's resources are represented bythe platform's firmware to the OS image.

Each distinct OS or image of an OS running within the platform isprotected from each other such that certain errors on one logicalpartition cannot affect the correct operation of any of the otherpartitions. This is provided by allocating a disjoint set of platformresources to be directly managed by each OS image and by providingmechanisms for ensuring that the various images cannot control anyresources that have not been allocated to it. Furthermore, softwareerrors in the control of an operating system's allocated resources areprevented from affecting the resources of any other image. Thus, eachimage of the OS (or each different OS) directly controls a distinct setof allocable resources within the platform in which some of theseresources are shared and others are unshared.

With respect to hardware resources in a LPAR system, these resources aredisjointly shared among various partitions, themselves disjoint, eachone seeming to be a stand-alone computer. These resources may include,for example, input/output (I/O) adapters, processors, and hard diskdrives. Each partition within the LPAR system may be booted and shutdownover and over without having to power-cycle the whole system.

With respect to reporting of errors that occur in logical partitioneddata processing systems or even in non-partitioned data processingsystems, recoverable errors are reported through an “in-band” reportingsystem. The error reports are sent to another data processing system,such as a hardware management console through a communications link,also referred to as a “connection”. The reporting of these errors allowsfor service calls to be made for the data processing system reportingthe error if needed. These connections are typically made over anetwork, such as a local area network, a wide area network, an intranet,or even the Internet. Since the recoverable errors are reported througha network interface, knowing about failures in the error reporting pathis extremely important. Presently available monitoring systems mayreport outages in a LAN before the LAN becomes operationally stable inaddition to reporting glitches in the LAN. As a result, undesirablefalse reporting may occur. The false report may cause a customer to turnoff the monitoring system and be exposed to a real outage in thereporting path going undetected.

Therefore, it would be advantageous to have an improved method,apparatus, and computer instructions for monitoring for outages in errorreporting paths.

SUMMARY OF THE INVENTION

The present invention provides a method in a data processing system formonitoring for errors on a network. Responsive to detecting a change ina state of the network, determine whether the change in state is a lossof a communications link to a remote data processing system. If thechange in state is a loss of the communications link, determine whetherthe communications link was established for at least a first period oftime to be considered an acceptable connection to the remote dataprocessing system. A new serviceable event is created, if a secondperiod of time passes without reestablishing the communications link tothe selected data processing system. Repeat occurrences of identicaloutages are tracked and multiple detected instances of an outage fordifferent partitions are counted as a single failure in the examples.

BRIEF DESCRIPTION OF THE DRAWINGS

The novel features believed characteristic of the invention are setforth in the appended claims. The invention itself, however, as well asa preferred mode of use, further objectives and advantages thereof, willbest be understood by reference to the following detailed description ofan illustrative embodiment when read in conjunction with theaccompanying drawings, wherein:

FIG. 1 is a block diagram of hardware components in a data processingsystem in which the present invention may be implemented;

FIG. 2 is a block diagram of an exemplary logical partitioned platformin which the present invention may be implemented;

FIG. 3 is a diagram of the logical partitioned multiprocessing servercomputer system of FIGS. 1 and 2 and a hardware management console inaccordance with the present invention;

FIG. 4 is a diagram of components used in providing customizablesurveillance in accordance with a preferred embodiment of the presentinvention; and

FIG. 5 is a flowchart of a surveillance process in accordance with apreferred embodiment of the present invention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT

With reference now to the figures, and in particular with reference toFIG. 1, a block-diagram of a data processing system in which the presentinvention may be implemented is depicted. Data processing system 100 maybe a symmetric multiprocessor (SMP) system including a plurality ofprocessors 101, 102, 103, and 104 connected to system bus 106. Forexample, data processing system 100 may be an IBM eServer, a product ofInternational Business Machines Corporation in Armonk, N.Y., implementedas a server within a network. Alternatively, a single processor systemmay be employed. Also connected to system bus 106 is memorycontroller/cache 108, which provides an interface to a plurality oflocal memories 160-163. I/O bus bridge 110 is connected to system bus106 and provides an interface to I/O bus 112. Memory controller/cache108 and I/O bus bridge 110 may be integrated as depicted.

Data processing system 100 is a logical partitioned (LPAR) dataprocessing system. Thus, data processing system 100 may have multipleheterogeneous operating systems (or multiple instances of a singleoperating system) running simultaneously. Each of these multipleoperating systems may have any number of software programs executingwithin it. Data processing system 100 is logical partitioned such thatdifferent PCI I/O adapters 120-121, 128-129, and 136, graphics adapter148, and hard disk adapter 149 may be assigned to different logicalpartitions. In this case, graphics adapter 148 provides a connection fora display device (not shown), while hard disk adapter 149 provides aconnection to control hard disk 150.

Thus, for example, suppose data processing system 100 is divided intothree logical partitions, P1, P2, and P3. Each of PCI I/O adapters120-121, 128-129, 136, graphics adapter 148, hard disk adapter 149, eachof host processors 101-104, and memory from local memories 160-163 isassigned to each of the three partitions. In these examples, memories160-163 may take the form of dual in-line memory modules (DIMMs). DIMMsare not normally assigned on a per DIMM basis to partitions. Instead, apartition will get a portion of the overall memory seen by the platform.For example, processor 101, some portion of memory from local memories160-163, and I/O adapters 120, 128, and 129 may be assigned to logicalpartition P1; processors 102-103, some portion of memory from localmemories 160-163, and PCI I/O adapters 121 and 136 may be assigned topartition P2; and processor 104, some portion of memory from localmemories 160-163, graphics adapter 148 and hard disk adapter 149 may beassigned to logical partition P3.

Each operating system executing within data processing system 100 isassigned to a different logical partition. Thus, each operating systemexecuting within data processing system 100 may access only those I/Ounits that are within its logical partition. Thus, for example, oneinstance of the Advanced Interactive Executive (AIX) operating systemmay be executing within partition P1, a second instance (image) of theAIX operating system may be executing within partition P2. In thisexample, partition P2 is also a nonwindow operating system. Depending onthe particular implementation, the mechanism of the present inventionmay be used with other operating systems in which windows is supported.For example, a Windows XP operating system may be operating withinlogical partition P1. Windows XP is a product and trademark of MicrosoftCorporation of Redmond, Wash.

Peripheral component interconnect (PCI) host bridge 114 connected to I/Obus 112 provides an interface to PCI local bus 115. A number of PCIinput/output adapters 120-121 may be connected to PCI bus 115 throughPCI-to-PCI bridge 116, PCI bus 118, PCI bus 119, I/O slot 170, and I/Oslot 171. PCI-to-PCI bridge 116 provides an interface to PCI bus 118 andPCI bus 119. PCI I/O adapters 120 and 121 are placed into I/O slots 170and 171, respectively. Typical PCI bus implementations will supportbetween four and eight I/O adapters (i.e. expansion slots for add-inconnectors). Each PCI I/O adapter 120-121 provides an interface betweendata processing system 100 and input/output devices such as, forexample, other network computers, which are clients to data processingsystem 100.

An additional PCI host bridge 122 provides an interface for anadditional PCI bus 123. PCI bus 123 is connected to a plurality of PCII/O adapters 128-129. PCI I/O adapters 128-129 may be connected to PCIbus 123 through PCI-to-PCI bridge 124, PCI bus 126, PCI bus 127, I/Oslot 172, and I/O slot 173. PCI-to-PCI bridge 124 provides an interfaceto PCI bus 126 and PCI bus 127. PCI I/O adapters 128 and 129 are placedinto I/O slots 172 and 173, respectively. In this manner, additional I/Odevices, such as, for example, modems or network adapters may besupported through each of PCI I/O adapters 128-129. In this manner, dataprocessing system 100 allows connections to multiple network computers.

A memory mapped graphics adapter 148 inserted into I/O slot 174 may beconnected to I/O bus 112 through PCI bus 144, PCI-to-PCI bridge 142, PCIbus 141 and PCI host bridge 140. Hard disk adapter 149 may be placedinto I/O slot 175, which is connected to PCI bus 145. In turn, this busis connected to PCI-to-PCI bridge 142, which is connected to PCI hostbridge 140 by PCI bus 141.

A PCI host bridge 130 provides an interface for a PCI bus 131 to connectto I/O bus 112. PCI I/O adapter 136 is connected to I/O slot 176, whichis connected to PCI-to-PCI bridge 132 by PCI bus 133. PCI-to-PCI bridge132 is connected to PCI bus 131, This PCI bus also connects PCI hostbridge 130 to the service processor mailbox interface and ISA bus accesspass-through logic 194 and PCI-to-PCI bridge 132. Service processormailbox interface and ISA bus access pass-through logic 194 forwards PCIaccesses destined to the PCI/ISA bridge 193. NVRAM storage 192 isconnected to the ISA bus 196. Service processor 135 is coupled toservice processor mailbox interface and ISA bus access pass-throughlogic 194 through its local PCI bus 195. Service processor 135 is alsoconnected to processors 101-104 via a plurality of JTAG/I²C busses 134.JTAG/I²C busses 134 are a combination of JTAG/scan busses (see IEEE1149.1) and Phillips I²C busses. However, alternatively, JTAG/I²C busses134 may be replaced by only Phillips I²C busses or only JTAG/scanbusses. All SP-ATTN signals of the host processors 101, 102, 103, and104 are connected together to an interrupt input signal of the serviceprocessor. The service processor 135 has its own local memory 191, andhas access to the hardware OP-panel 190.

When data processing system 100 is initially powered up, serviceprocessor 135 uses the JTAG/I²C busses 134 to interrogate the system(host) processors 101-104, memory controller/cache 108, and I/O bridge110. At completion of this step, service processor 135 has an inventoryand topology understanding of data processing system 100. Serviceprocessor 135 also executes Built-In-Self-Tests (BISTs), Basic AssuranceTests (BATs), and memory tests on all elements found by interrogatingthe host processors 101-104, memory controller/cache 108, and I/O bridge110. Any error information for failures detected during the BISTs, BATs,and memory tests are gathered and reported by service processor 135.

If a meaningful/valid configuration of system resources is stillpossible after taking out the elements found to be faulty during theBISTs, BATs, and memory tests, then data processing system 100 isallowed to proceed to load executable code into local (host) memories160-163. Service processor 135 then releases host processors 101-104 forexecution of the code loaded into local memory 160-163. While hostprocessors 101-104 are executing code from respective operating systemswithin data processing system 100, service processor 135 enters a modeof monitoring and reporting errors. The type of items monitored byservice processor 135 include, for example, the cooling fan speed andoperation, thermal sensors, power supply regulators, and recoverable andnon-recoverable errors reported by processors 101-104, local memories160-163, and I/O bridge 110.

Service processor 135 is responsible for saving and reporting errorinformation related to all the monitored items in data processing system100. Service processor 135 also takes action based on the type of errorsand defined thresholds. For example, service processor 135 may take noteof excessive recoverable errors on a processor's cache memory and decidethat this is predictive of a hard failure. Based on this determination,service processor 135 may mark that resource for deconfiguration duringthe current running session and future Initial Program Loads (IPLs).IPLs are also sometimes referred to as a “boot” or “bootstrap”.

Data processing system 100 may be implemented using various commerciallyavailable computer systems. For example, data processing system 100 maybe implemented using IBM eServer pSeries 690 Model 681 system availablefrom International Business Machines Corporation. Such a system maysupport logical partitioning using an AIX operating system, which isalso available from International Business Machines Corporation.

Those of ordinary skill in the art will appreciate that the hardwaredepicted in FIG. 1 may vary. For example, other peripheral devices, suchas optical disk drives and the like, also may be used in addition to orin place of the hardware depicted. The depicted example is not meant toimply architectural limitations with respect to the present invention.

With reference now to FIG. 2, a block diagram of an exemplary logicalpartitioned platform is depicted in which the present invention may beimplemented. The hardware in logical partitioned platform 200 may beimplemented as, for example, data processing system 100 in FIG. 1.Logical partitioned platform 200 includes partitioned hardware 230, Openfirmware 210, and operating systems 202-208. Operating systems 202-208may be multiple copies of a single operating system or multipleheterogeneous operating systems simultaneously run on logicalpartitioned platform 200.

Partitioned hardware 230 includes a plurality of processors 232-238, aplurality of system memory units 240-246, a plurality of input/output(I/O) adapters 248-262, and a storage unit 270. Each of the processors232-238, memory units 240-246, and I/O adapters 248-262 may be assignedto one of multiple partitions within logical partitioned platform 200,each of which corresponds to one of operating systems 202-208. NV-RAM isdivided between each of the partitions; it is not assigned to any onespecific partition.

Open firmware 210 performs a number of functions and services foroperating system images 202-208 to create and enforce the partitioningof logical partitioned platform 200. Firmware is “software” stored in amemory chip that holds its content without electrical power, such as,for example, read-only memory (ROM), programmable ROM (PROM), erasableprogrammable ROM (EPROM), electrically erasable programmable ROM(EEPROM), and non-volatile random access memory (NVRAM).

Open firmware 210 provides the OS images 202-208 running in multiplelogical partitions each a virtual copy of a console and operator panel.The interface to the console is changed from an asynchronous teletypeport device driver, as in the prior art, to a set of Open Firmware callsthat emulate a port device driver. The Open firmware 210 encapsulatesthe data from the various OS images onto a message stream that istransferred to a computer 280, known as a hardware management console.

Open firmware 210 includes system boot firmware. A mechanism built intoeach of processors 232-238 as an architected instruction allows systemfirmware 210 to execute at any time. Thus, system checkpoints may beimmediately displayed to the operator panel window on hardwaremanagement console 280 and also immediately logged to non-volatilerandom access memory (NVRAM) even before the I/O path to these deviceshas been configured to accept any programmed input/output (PIO)accesses. Hardware management console 280 is connected directly tological partitioned platform 200 as illustrated in FIG. 2, and isconnected to logical partitioned platform through a network, such as alocal area network (LAN), a wide area network (WAN), an intranet, oreven the Internet. Hardware management console 280 may be, for example,a desktop or laptop computer. Hardware management console 280 decodesthe message stream and displays the information from the various OSimages 202-208 in separate windows, at least one per OS image.Similarly, keyboard input information from the operator is packaged bythe hardware management console, sent to logical partitioned platform200 where it is decoded and delivered to the appropriate OS image viaopen firmware 210 emulated port device driver associated with the thenactive window on the hardware management console 280.

With reference now to FIG. 3, a block diagram of the logical partitionedmultiprocessing server computer system of FIGS. 1 and 2 and a hardwaremanagement console in accordance with a preferred embodiment of presentinvention.

Data processing system 100 includes a plurality of operating system (OS)partitions 302, 304, 306, and 308. These partitions receive inputs frominput/output (I/O) devices, and from base hardware, which may be a powersupply, a cooling supply, a fan, memory, and processors. Any one ofmultiple, different operating systems, such as AIX or LINUX, can berunning in any partition. For example, AIX is shown in partitions 302and 306, while LINUX is shown in partitions 304 and 308. Although fouroperating system partitions are shown, any number of partitions with anyone of a variety of different operating systems may be utilized.

Each partition includes an error log and a manager. When an error occurswithin a partition, the error is logged into the partition's error log.The manager formats error information into the standard format andforwards the error information in the form of an error event log entryto hardware management console 314. For example, partition 302 includeserror log 310 and resource monitor control (RMC) 312; partition 304includes error log 314 and resource monitor control 316; partition 306includes error log 318 and resource monitor control 320; and partition308 includes error log 322 and resource monitor control 324. Theresource monitor control is notified of errors and reports the errors tohardware management console 326.

The present invention provides an improved method, apparatus, andcomputer instructions for surveillance of network interfaces. Thismonitoring mechanism is especially suitable for monitoring connectionsused to report errors. The mechanism of the present invention filtersout conditions such as those in which the network is not yet stable. Themechanism of the present invention checks to ensure that a connection ismade within a certain time period of a partition becoming active. Themechanism of the present invention does not monitor the loss of aconnection until the connection is considered an acceptable connectionin which the connection has been present for some selected period oftime. This parameter may be customized to fit particular network setups.Also, the mechanism of the present invention filters out networkglitches. In these examples, a network glitch is any temporary loss ofservice in the network. These and other features of the presentinvention allow for a reduction or elimination of nuisance outagereports, such as those caused by LAN instability, that would cause acustomer to turn off monitoring and be exposed to an occurrence of areal outage that goes undetected.

Turning next to FIG. 4, a diagram of components used in providingcustomizable surveillance is depicted in accordance with a preferredembodiment of the present invention. FIG. 4 provides an illustration ofdata flow use in monitoring communications links, such as connections topartitions from a hardware management console.

Data processing system 400 includes partition 402 and partition 404.These partitions access hardware 406, which includes components, such asa power supply, cooling fans, memory, I/O adapters, and processors.Recoverable errors 408 are stored in error log 410 and error log 412through open firmware 414. The entries within error log 410 notifiesresource monitor control 416, and entries within error log 412 notifiesresource monitor control 418.

These errors are reported to hardware management console 420 throughcommunications links 422 and 424. These communications links, alsoreferred to as “connections” are made over a network, such as a localarea network. In these examples, each partition establishes a separatecommunications link with hardware management console 420. Service focalpoint 426 receives the reports of errors from resource monitor control416 and resource monitor control 418 in partitions 402 and 404,respectively. This type of error reporting is referred to as “in-band”reporting. Another type of error reporting is referred to as “out ofband” reporting. These errors are reported to hardware managementconsole 420 by service processor 428. These type of errors are fatalerrors and the connections are not through a network, such as thein-band reporting of recoverable errors by the partition 402 andpartition 404.

With respect to the recoverable errors report though in-band reporting,these reports are received by service focal point 426 through resourcemonitor control 430. An appropriate service action event, such asservice action event 432 is generated. This event is sent service 434 byService Agent gateway 436.

Service focal point 426 also monitors the state of communications links422 and 424. In particular, the monitoring is for failures in in-bandreporting paths, such as communications links 422 and 424. Themonitoring provided filters out glitches in the network through whichcommunications links 422 and 424 are routed. In these examples,communications links 422 and 424 provide a single point of failure forpartition 402 and partition 404 for reporting errors. Without theseconnections, the partitions are unable to report errors and extendederror data (EED) would not be reported to service focal point 426.

In the depicted examples, service resource monitor 438 monitors anetwork connections between each partition for failures in these networkconnections. Service resource monitor 438 detects when an establishedconnection for a session, such as one between resource monitor control416 and resource monitor control 430, go down.

Turning now to FIG. 5, a flowchart of a surveillance process is depictedin accordance with a preferred embodiment of the present invention. Theprocess illustrated in FIG. 5 may be implemented at a data processingsystem, such as hardware management console 420 in FIG. 4. In thisexample, the process is implemented in service resource monitor 438.

The process begins by detecting an event (step 500). One event that maybe detected in step 500 is a local area network (LAN) state change for aconnection to a particular partition. In these examples, a state changemay be either an establishment of a communications link or the loss ofthis connection assuming that a successful connection has occurred. Ifthe event is a LAN state change, a determination is made if surveillancehas been defined or selected with this connection (step 502).Surveillance may be configured for a partition as part of a partitionprofile setup. Some or all of the partitions may be monitored. Ifsurveillance has been defined for this connection, then a determinationis made as to whether the state change is a loss of the LAN connection(step 504). If the LAN connection has been lost, a determination is madeas to whether a connected timer “A” has expired (step 506). This step isused to ensure that the connection was established for a sufficientamount of time to be considered an acceptable connection. In otherwords, this time is used to determine whether a connection actually hasbeen considered to be made. It not enough time has passed the connectionis not considered acceptable or one that is actually present.Connections that have lasted for shorter periods of time may indicatethat the network is not operationally stable and should be filtered out,rather than generating a notification of a serviceable event. If theconnected timer “A” has not expired, the process terminates.

Otherwise, an outage timer “B” is started (step 508). This timer is usedto determine if the network change for a connection that has beenestablished long enough to be an acceptable connection is a networkglitch. If a new state change occurs before timer “B” expires, theprocess is started again and nothing is reported with the last changebeing considered a network glitch.

A determination is then made as to whether the outage timer “B” hasexpired. The process continues to return to step 508 until this timerexpires. This loop may be broken by a change in the state of thenetwork, which causes the entire process to begin again. In fact, astate change in the network will interrupt the process in stepsillustrated 500 through 510 in FIG. 5 to restart at step 500.

If outage timer “B” expires, a determination is made as to whether asurveillance serviceable event (SE) was created in the last “C” amountof time (step 512). In this example, “C” is a window of time that is setto look back over some period of time. This step is used to determine ifthe latest outage has the same root cause as a previous serviceableevent. If a surveillance serviceable event has not been created in thelast C amount of time, a new surveillance serviceable event is created(step 514), and Service Agent is notified of the event (step 516) withthe process terminating thereafter.

Otherwise, a determination is made as to whether a different referencecode is present for the surveillance serviceable event (step 518). If adifferent reference code is present for this event as compared to aprior surveillance serviceable event, a determination is made as towhether the service event contains the partition in which the statechange is detected (step 520). Step 520 allows for an implementationsuch that this process generates only a single report for a particulartype of surveillance error even when reported from multiple partitionsin the logical partitioned data processing system.

If the service event includes the partition, a duplicate counter isincremented to the existing managed object (step 528) with the processterminating thereafter. As used herein, the managed object is the errorreport for the partition under surveillance.

With reference again to step 520, if the service event does not containthis partition, the partition is added to the service event (step 524),with the process terminating thereafter. This step allows for serviceevents to be reported for multiple partitions for a particular type ofservice event. Turning back to step 500, if the event detected is apartition activation notification, the process proceeds to step 508 tostart the outage timer. This event causes the process to proceed to step510 to take into account the activation of a partition that should bemonitored in which a successful connection does not occur.

With reference again to step 500, if the event is a partitiondeactivation notification, the timers are cleared (step 526), with theprocess terminating thereafter. The timers are cleared to prevent anyservice events from being generated. In this case, the user has powereddown the partition. In other words, the surveillance of the partition isturned off in response to this deactivation of this partition.

Turning back to step 504, if the LAN connection was not lost, then aconnected timer A is started (step 522). In this case, the state changeis the establishment of the connection. This timer is employed alongwith step 506 to filter out short term network connections and avoid thegeneration of multiple serviceable events if the network is restoredonly for a short period of time. All three of the times “A”, “B”, and“C” described with reference to FIG. 5 have default values, but may beset by the user for different periods of time depending on theimplementation to reflect the stability of the monitored network.

The reference codes generated for surveillance serviceable events areimplementation specific. The mechanism of the present invention alsoprovides an interface in the form of a graphical user interface (GUI) toallow an operator to enable or disable surveillance based on thecriticality of reporting errors for that partition. The surveillancealso may be enabled or disabled for other instances, such as whendevelopment partitions are running. These partitions may be brought upand down at different times. The parameters for timers may be tailoredfor a particular installation and network reliability characteristics.

Thus, the mechanism of the present invention provides an improvedmethod, apparatus, and computer instructions for monitoring for failuresin connections. Glitches in the network are filtered out along withmaking sure that the network is operationally stable. In other words,the connection has to be present for some selected period of time beforethe connections is considered acceptable for monitoring. Repeatoccurrences of identically caused outages are counted as well asreporting new occurrences of outages for different failure causes.

The mechanism of the present invention detects a number of differenttypes of outages, such as an outage due to a partition being down or acommunication path to the partition being down. In these examples, theoutage of surveillance on a partition is reported only if the partitionis active and surveillance is enabled for the partition. Outages due toauthentication failures may be detected with the present invention.Also, outages due to the resource monitor control on the hardwaremanagement console may be detected as well as other types of outages.

This mechanism may be applied to other types of paths other than thosefor reporting errors. Further, the mechanism may be applied to any typeof network or telecommunications network.

It is important to note that while the present invention has beendescribed in the context of a fully functioning data processing system,those of ordinary skill in the art will appreciate that the processes ofthe present invention are capable of being distributed in the form of acomputer readable medium of instructions and a variety of forms and thatthe present invention applies equally regardless of the particular typeof signal bearing media actually used to carry out the distribution.Examples of computer readable media include recordable-type media, suchas a floppy disk, a hard disk drive, a RAM, CD-ROMs, DVD-ROMs. Thecomputer readable media may take the form of coded formats that aredecoded for actual use in a particular data processing system.

The description of the present invention has been presented for purposesof illustration and description, and is not intended to be exhaustive orlimited to the invention in the form disclosed. Many modifications andvariations will be apparent to those of ordinary skill in the art. Theembodiment was chosen and described in order to best explain theprinciples of the invention, the practical application, and to enableothers of ordinary skill in the art to understand the invention forvarious embodiments with various modifications as are suited to theparticular use contemplated.

1. A method in a data processing system for monitoring for errors on anetwork, the method comprising: responsive to detecting a change in astate of the network, determining whether the change in state is one of(i) a loss of a communications link to a remote data processing systemand (ii) a re-establishment of the communication link to the remote dataprocessing system; if the change in state is the re-establishment of thecommunications link, starting a connection timer, wherein the connectiontimer is used to filter out short term establishment of thecommunications link; if the change in state is the loss of thecommunications link, determining whether the communications link wasestablished for at least a first period of time to be considered anacceptable connection to the remote data processing system, wherein theacceptable connection is present upon expiration of the connection timerthat was started in response to the change in the state of the networkbeing the re-establishment of the communication link; and creating a newserviceable event, if a second period of time passes withoutreestablishing the communications link to the remote data processingsystem.
 2. The method of claim 1 further comprising: maintaining aduplicate counter that counts multiple duplicative state changes thatoccur within a third period of time.
 3. The method of claim 1, furthercomprising: determining whether a prior serviceable event was createdwithin a third period of time if the second period of time passeswithout reestablishing the communications link to a selected dataprocessing system, wherein the new serviceable event is created if theprior serviceable event was not created within the third period of time.4. The method of claim 1 further comprising: sending a notification ofthe new serviceable event to an administrator.
 5. A method in a dataprocessing system for monitoring for errors on a network, the methodcomprising: responsive to detecting a change in a state of the network,determining whether the change in state is a loss of a communicationslink to a remote data processing system; if the change in state is aloss of the communications link, determining whether the communicationslink was established for at least a first period of time to beconsidered an acceptable connection to the remote data processingsystem; creating a new serviceable event, if a second period of timepasses without reestablishing the communications link to the remote dataprocessing system; determining whether a prior serviceable event wascreated within a third period of time if the second period of timepasses without reestablishing the communications link to the selecteddata processing system; and if a prior serviceable event was createdwithin a third period of time, adding an indication of a new outage tothe prior serviceable event instead of creating the new serviceableevent.
 6. The method of claim 1, wherein the remote data processingsystem is a logical partitioned data processing system having aplurality of logical partitions with each of the plurality of logicalpartitions (i) having at least one physical processor assigned thereto,and (ii) executing its own operation system which is a differentinstance from other operating systems executing on other of theplurality of logical partitions, and wherein the communications link isestablished by a partition in the logical partitioned data processingsystem.
 7. The method of claim 1, wherein the first period of time andthe second period of time are set by a user.
 8. The method of claim 1,wherein the method is disabled when a partition is not active.
 9. Themethod of claim 1, further comprising: creating the new serviceableevent if a communications link is not established to the remote dataprocessing system within a selected period of time.